Modeling the visual changes that an action brings to a scene is critical for video understanding. Currently, CNNs process one local neighbourhood at a time, thus contextual relationships over longer ranges, while still learnable, are indirect. We present TROI, a plug-and-play module for CNNs to reason between mid-level feature representations that are otherwise separated in space and time. The module relates localized visual entities such as hands and interacting objects and transforms their corresponding regions of interest directly in the feature maps of convolutional layers. With TROI, we achieve state-of-the-art action recognition results on the large-scale datasets Something-Something-V2 and EPIC-Kitchens-100.
translated by 谷歌翻译
Can we teach a robot to recognize and make predictions for activities that it has never seen before? We tackle this problem by learning models for video from text. This paper presents a hierarchical model that generalizes instructional knowledge from large-scale text corpora and transfers the knowledge to video. Given a portion of an instructional video, our model recognizes and predicts coherent and plausible actions multiple steps into the future, all in rich natural language. To demonstrate the capabilities of our model, we introduce the \emph{Tasty Videos Dataset V2}, a collection of 4022 recipes for zero-shot learning, recognition and anticipation. Extensive experiments with various evaluation metrics demonstrate the potential of our method for generalization, given limited video data for training models.
translated by 谷歌翻译
我们呈现MSEG,该数据集统一来自不同域的语义分段数据集。由于分类和注释实践不一致,因此,构成数据集的天真合并产生了差的表现。我们通过在超过80,000张图像中重新标记超过220,000个对象掩码,需要超过1.34年的集体注释员努力,调整分类管理并将像素级注释带标记为超过220,000个对象掩码。生成的复合数据集使训练单个语义分段模型可以有效地跨域功能并推广到培训期间未见的数据集。我们采用零拍摄的跨数据集转移作为基准,以系统地评估模型的稳健性,并表明MSEG培训与在没有所提出的贡献的数据集的单个数据集或天真混合的情况下,产生了大量更强大的模型。在MSEG培训的模型首先在Wilddash-V1排行榜上排名为强大的语义细分,在训练期间没有暴露于野生垃圾数据。我们在2020年的强大视觉挑战(RVC)中评估我们的模型,作为一个极端的泛化实验。 MSEG培训集中仅包括RVC中的七个数据集中中的三个;更重要的是,RVC的评估分类是不同的,更详细。令人惊讶的是,我们的模型显示出竞争性能并排名第二。为了评估我们对强大,高效和完整的场景理解的宏伟目的的关机,我们通过使用我们的数据集进行训练实例分段和Panoptic Seation模型超越语义分割。此外,我们还评估了各种工程设计决策和度量,包括分辨率和计算效率。虽然我们的模型远非这一隆重目标,但我们的综合评价对于进步至关重要。我们与社区分享所有模型和代码。
translated by 谷歌翻译
We are concerned with learning models that generalize well to different unseen domains. We consider a worst-case formulation over data distributions that are near the source domain in the feature space. Only using training data from a single source distribution, we propose an iterative procedure that augments the dataset with examples from a fictitious target domain that is "hard" under the current model. We show that our iterative scheme is an adaptive data augmentation method where we append adversarial examples at each iteration. For softmax losses, we show that our method is a data-dependent regularization scheme that behaves differently from classical regularizers that regularize towards zero (e.g., ridge or lasso). On digit recognition and semantic segmentation tasks, our method learns models improve performance across a range of a priori unknown target domains.
translated by 谷歌翻译